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Abstract 

This  paper  presents  a  new  algorithm  for  the  independent  components  analysis  (ICA) 
problem  based  on  efficient  entropy  estimates.  Like  many  previous  methods,  this  algorithm 
directly  minimizes  the  measure  of  departure  from  independence  according  to  the  estimated 
Kullback-Leibler  divergence  between  the  joint  distribution  and  the  product  of  the  marginal 
distributions.  We  pair  this  approach  with  efficient  entropy  estimators  from  the  statistics 
literature.  In  particular,  the  entropy  estimator  we  use  is  consistent  and  exhibits  rapid 
convergence.  The  algorithm  based  on  this  estimator  is  simple,  computationally  efficient, 
intuitively  appealing,  and  outperforms  other  well  known  algorithms.  In  addition,  the  esti¬ 
mator’s  relative  insensitivity  to  outliers  translates  into  superior  performance  by  our  ICA 
algorithm  on  outlier  tests.  We  present  favorable  comparisons  to  the  Kernel  ICA,  FAST- 
ICA,  JADE,  and  extended  Infomax  algorithms  in  extensive  simulations. 

1.  Introduction 

We  present  a  new  independent  components  analysis  (ICA)  algorithm,  RADICAL.  Empir¬ 
ical  results  indicate  that  it  outperforms  a  wide  array  of  well  known  algorithms.  Several 
straightforward  principles  under ly  the  development  of  RADICAL: 

1.  Since  ICA  is,  by  definition,  about  maximizing  statistical  independence,  we  attempt 
to  directly  optimize  a  measure  of  statistical  independence,  rather  than  a  surrogate  for 
this  measure. 

2.  We  avoid  explicit  estimation  of  probability  densities  as  an  intermediate  step.  Indeed, 
given  the  formulation  of  the  objective  function,  density  estimation  (even  implicitly) 
is  entirely  unnecessary. 

3.  Since  our  objective  function  involves  one-dimensional  entropy  estimation,  we  employ  a 
well-known^,  consistent,  rapidly  converging  and  computationally  efficient  estimator  of 

1.  Although  the  estimator  we  use  (Vasicek  (1976))  has  been  extensively  analyzed  in  the  statistics  literature, 

it  appears  to  be  relatively  unknown  in  the  machine  learning  community. 
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entropy  which  is  robust  to  outliers.  For  this  task,  we  turned  to  the  statistics  literature, 
where  entropy  estimators  have  been  studied  extensively  (c.f.  Beirlant  et  al.  (1997)). 

4.  As  the  optimization  landscape  has  potentially  many  local  minima,  we  eschew  gradient 
descent  methods.  The  fact  that  the  estimator  is  computationally  efficient  allows  for 
a  global  search  method.  The  properties  of  the  ICA  problem  allow  extension  of  this 
technique  to  higher  dimensions  in  a  tractable  manner. 

Attention  to  these  principles  led  to  the  Robust,  Accurate,  Direct  ICA  aLgorithm  (RADI¬ 
CAL)  presented  here. 

The  paper  is  organized  as  follows.  We  begin  by  setting  up  the  problem  and  discussing 
aspects  of  the  contrast  function  originally  proposed  by  Comon  (1994)  which  can  be  simplified 
to  a  sum  of  one-dimensional  marginal  entropies  (c.f.  Kullback  (1959)).  In  Section  2,  we 
discuss  entropy  estimates  based  on  order  statistics.  One  method  in  particular  satisfies  our 
requirements,  the  m-spacing  estimator  (Vasicek,  1976). 

This  entropy  estimator  leads  naturally  to  a  simple  ICA  algorithm,  a  two-dimensional 
version  of  which  is  described  in  Section  3.  We  follow  this  with  a  discussion  of  complexity 
and  experimental  results  for  the  two-dimensional  problem.  In  addition  to  working  for  a 
wide  variety  of  possible  source  distributions,  we  demonstrate  that  RADICAL  has  excellent 
robustness  to  the  presence  of  outliers.  In  Section  4,  we  extend  the  algorithm  to  higher 
dimensional  problems,  with  experiments  in  up  to  16  dimensions.  We  discuss  additional 
details  of  the  algorithm  and  discuss  why  the  complexity  is  still  manageable,  even  using  our 
exhaustive  search  approach. 

1.1  Linear  ICA  and  KL  divergence 

As  articulated  by  Comon  (1994),  independent  components  analysis  (ICA)  or  alternatively 
blind  source  separation  as  applied  to  instantaneous  linear  mixtures  considers  the  generative 
model  of  random  observations 

A  =  AS.  (1) 

Here  X  G  and  S  G  are  random  vectors,  and  A  G  is  a  fixed  but  unknown 

(hence  the  term  blind)  mixing  matrix.  Typically,  at  a  minimum,  one  assumes  that 

1.  the  mixing  matrix  A  has  full  rank, 

2.  the  components  of  S  are  mutually  independent,  and 

3.  C>D. 

Condition  2  is  equivalent  to  the  statement  that  the  joint  density  of  the  components  of  S 
can  be  expressed  as  a  product  of  the  marginals: 


D 

p(5i,---  ,5zr)  =  nP(^*)-  (2) 

i=l 

Additionally,  here  we  shall  restrict  ourselves  to  the  case  where  C  =  D  (i.e.  A  is  square) 
without  loss  of  generality.  The  goal  is  to  recover  (in  some  sense)  the  sources  and  perhaps 
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the  mixing  matrix  via  a  transformation  W  on  observations  of  X,  that  is 


Y  =  WX  (3) 

=  WAS  (4) 

=  BS.  (5) 

Given  the  minimal  statement  of  the  problem,  it  has  been  shown  (Benveniste  et  ah,  1980, 
Comon,  1994)  that  one  can  recover  the  original  sources  up  to  a  scaling  and  permutation 
provided  that  at  most  one  of  the  underlying  sources  is  Gaussian  and  the  rest  are  non- 
Gaussian.  Upon  pre-whitening  the  observed  data  the  problem  reduces  to  a  search  over 
rotation  matrices  in  order  to  recover  the  sources  and  mixing  matrix  in  the  sense  described 
above  (Hyvarinen,  2001,  Bach  and  Jordan,  2002).  We  will  assume  henceforth  that  such 
pre-processing  has  been  done. 

While  the  problem  statement  is  fairly  straightforward  with  a  minimum  of  assumptions 
it  has  been  well  studied,  resulting  in  a  vast  array  of  approaches.  Some  of  the  more  notable 
approaches  can  be  roughly  grouped  into  maximum  likelihood  based  methods  (Pham  et  ah, 
1992,  Pearlmutter  and  Parra,  1996),  moment /cumulant  based  methods  (Gomon,  1994,  Gar- 
doso  and  Souloumiac,  1996,  Gardoso,  1999b,  Hyvarinen,  2001),  entropy  based  methods  (Bell 
and  Sejnowski,  1995,  Hyvarinen,  1999),  and  correlation  based  methods  (Jutten  and  Herault, 
1991,  Bach  and  Jordan,  2002). 

Many  approaches  start  the  analysis  of  the  problem  by  considering  the  contrast  function 
(Gomon,  1994) 


KL  ,yD)\\Y[v{yi 


^H(F,)-H(Fi,...  ,Yd). 


Here  dy  =  dyidy2  ■  ■  ■  dyn  and  H{Y)  is  the  differential  entropy  (Shannon,  1948)  of  the  con¬ 
tinuous  multi-dimensional  random  variable  Y .  The  right  side  of  (6)  is  the  Kullback-Leibler 
divergence  (Kullback,  1959),  or  relative  entropy,  between  the  joint  density  of  {li, . . .  jTd} 
and  the  product  of  its  marginals. 

The  utility  of  (6)  for  purposes  of  the  IGA  problem  has  been  well  documented  in  the  liter¬ 
ature  (c.f.  Gomon  (1994),  Lee  et  al.  (1999a)).  Briefly  we  note  that  for  mutually  independent 
random  variables  Yi^Y^,  ...lYjy  we  have: 


J{Y)  =  /  p(2/i, 2/2,  •••,2/d)  log 


p(2/i, 2/2,  •••,2/d) 
p{yi)p{y2)-p{yD) 


=  yp(2/i, 2/2,  •••,2/d)  log  1  d/i 


=  0.  (11) 

Since  this  quantity  will  be  0  if  and  only  if  all  of  the  variables  are  mutually  independent^  we 
take  (6)  as  a  direct  measure  of  mutual  independence. 
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As  a  function  of  X  and  W  it  is  easily  shown  (c.f.  (Cover  and  Thomas,  1991,  Bell  and 
Sejnowski,  1995,  Hyvarinen,  2001))  that 

D 

J{Y)  =  Y,H{Yi) log  {\W\).  (12) 

i=l 

In  particular,  the  change  in  the  entropy  of  the  joint  distribution  under  linear  transformation 
is  simply  the  logarithm  of  the  Jacobian  of  the  transformation.  As  we  will  assume  the  A’s 
are  pre-whitened,  W  will  be  restricted  to  rotation  matrices  (i.e.  log(|IT|)  =  0)  and  the 
minimization  of  J(Y)  reduces  to  finding 

W*  =  aigmmH{Yi)  +  ---  +  H{YD).  (13) 

w 

The  preceding  development  was  necessary  to  bring  us  to  the  primary  contribution  of 
this  paper.  The  observations  noted  in  the  development  are  the  collective  contributions 
of  the  cited  authors.  As  has  been  noted  (Hyvarinen,  1999),  ICA  algorithms  consist  of 
an  objective  (contrast)  function  and  an  optimization  algorithm.  We  adopt  the  previously 
proposed  objective  criterion  of  (13)  and  present  a  means  of  both  estimating  its  value  and 
optimizing  the  choice  of  W  via  a  method  which  is  reliable,  robust,  and  computationally 
efficient.  These  are  the  aspects  of  our  proposed  approach  which  will  be  the  subject  of  the 
rest  of  the  paper. 

Towards  that  end,  we  adopt  a  different  entropy  estimator  to  minimize  (13).  The  entropy 
estimator  is  almost  identical  to  one  described  by  Vasicek  (1976)  and  others  (c.f.  Beirlant 
et  al.  (1997)  for  a  review)  in  the  statistics  literature.  This  class  of  entropy  estimators  has 
not  heretofore  been  applied  to  the  ICA  problem.  As  we  will  show,  the  use  of  this  entropy 
estimator  has  a  significant  impact  on  performance  as  compared  to  other  ICA  algorithms 
and  as  discussed  in  the  sections  on  experimental  results.  In  addition,  it  has  the  following 
desirable  properties: 

•  It  is  consistent. 

•  It  converges  as  the  square  root  of  N,  the  number  of  data  points,  and  is  asymptotically 
efficient. 

•  It  is  computable  in  0{N  log  N)  time. 

In  the  next  section,  we  present  a  detailed  discussion  of  the  entropy  estimator  and  its  prop¬ 
erties. 

2.  Entropy  Estimators  for  Continuous  Random  Variables 

There  are  a  variety  of  ICA  algorithms  that  minimize  (13)  to  find  the  independent  com¬ 
ponents  (e.g.  Comon  (1994)).  These  algorithms  differ  mostly  in  how  they  estimate  the 
entropy  of  the  one-dimensional  marginal  variables.  Hyvarinen  (1997),  for  example,  devel¬ 
oped  a  new  entropy  estimator  for  this  purpose.  RADICAL  also  uses  entropy  minimization 
at  its  core,  and  as  such  must  estimate  the  entropy  of  each  marginal  for  each  possible  W 
matrix.  RADICAL’S  marginal  entropy  estimates  are  functions  of  the  order  statistics  of 
those  marginals. 
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2.1  Order  statistics  and  spacings 

Consider  a  one-dimensional  random  variable  Z,  and  a  random  sample  of  Z  denoted  by 
The  order  statistics  of  a  random  sample  of  Z  are  simply  the  elements  of 
the  sample  rearranged  in  non-decreasing  order:  <  Z^‘^'>  <  ...  <  Z^^^  (c.f.  Arnold 

et  al.  (1992)).  A  spacing  of  order  m,  or  m-spacing,  is  then  defined  to  be  —  Z^*\  for 

l<i<i-\-m<N.  Finally,  if  m  is  a  function  of  N,  one  may  define  the  m-spacing  as 

The  mAT— spacing  estimator  of  entropy,  originally  due  to  Vasicek  (1976),  can  now  be 
defined  as 


N-mpf  .  s 

Hn{Z\Z\...,Z^)  =  -  V  log  - (z(*+-^^)-Z«)  .  (14) 

N  ^  \mN  ) 

This  estimator  is  nearly  equivalent  to  the  one  used  in  RADICAL,  which  is  discussed  below. 
To  see  where  this  estimator  comes  from,  we  first  make  the  following  observation  regarding 
order  statistics.  For  any  random  variable  Z  with  an  impulse-free  density  p{-)  and  continu¬ 
ous  distribution  function  P{-),  the  following  holds.  Let  p*  be  the  N-way  product  density 
p*{Z\Z^,...,Z^)  =p{Z^)p{Z^)...p{Z^).  Then 


F;p*[P(Z(*+i))-P(Z«)]  =  ^^,  Vi,l<i<iV-l.  (15) 


That  is,  the  expected  value  of  the  probability  mass  of  the  interval  between  two  successive 
elements  of  a  sample  from  a  random  variable^  is  just  of  the  total  probability  (which 
by  definition  must  be  equal  to  1.0).  This  surprisingly  general  fact  is  a  simple  consequence 
of  the  uniformity  of  the  random  variable  P{Z).  P{Z),  the  random  variable  describing  the 
“height”  on  the  cumulative  curve  of  a  random  draw  from  Z  (as  opposed  to  the  function 
P{z))  is  called  the  probability  integral  transform  of  Z  (c.f.  Manoukian  (1986)).  Thus,  the 
key  insight  is  that  the  intervals  between  successive  order  statistics  have  the  same  expected 
probability  mass. 

Using  this  idea,  one  can  develop  a  simple  entropy  estimator.  We  start  by  approximating 
the  probability  density  p{z)  by  assigning  equivalent  masses  to  each  interval  between  points 
and  assuming  a  uniform  distribution  of  this  mass  across  the  interval^.  Defining  Z^^^  to  be 
the  infimum  of  the  support  of  p{z)  and  defining  to  be  the  supremum  of  the  support 

of  p{z),  we  have: 


p{z-,Z\...,Z^)  = 


Af+l 


^(i+1)  _  Z(^) 


(16) 


2.  The  probability  mass  of  the  interval  between  two  successive  points  can  also  be  thought  of  as  the  integral 
of  the  density  function  between  these  two  points. 

3.  We  use  the  notion  of  a  density  estimate  to  aid  in  the  intuition  behinid  m— spacing  estimates  of  entropy. 
However,  as  we  stress  below,  density  estimation  is  not  a  necessary  intermediate  step  in  entropy  estimation. 
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for  Then,  we  can  write 


H{Z)  = 

(a) 


'  — oo 
r*oo 


p{z)  \ogp{z)dz 

/OQ 

p{z)  \ogp{z)dz 

-OO 

V  /  p(z)logp(2;)<i2; 

A'" 

E 

i=0 
N 

X]  ^(*+1)  _  ^(*)  ^(*+1)  _ 

i=0 


1 

TV+l 


1 


■log  ■ 


Af+1 


^(i+1)  _  ^(*)  ^  ^(i+1)  _  ^(*) 

1  1 


TV+l 


■log 


1  /-Zl® 

Af+1  / 

■^)  —  Zh)  7^(1) 


1 


AT 


J^log 


AT  +  l 


N  +  1^"'"  Zh+1)  -  Z(*) 

i=0 


(fe) 


1 


AT-l 


AT+l 


N  -I 
1 


iV-  1 


^  Zh+1)  -  Z(*) 

^  log  [{N  +  -  ZW)) 


2=1 
7V-1 


2=1 


(17) 

(18) 

(19) 

(20) 
(21) 
(22) 

(23) 

(24) 

(25) 


The  approximation  (a)  arises  by  approximating  the  true  density  p{z)  by  p{z;  , ...,  Z^). 
The  approximation  (6)  stems  from  the  fact  that  in  general  we  do  not  know  and 
i.e.  the  true  support  of  the  unknown  density.  Therefore,  we  form  the  mean  log  density  esti¬ 
mate  using  only  information  from  the  region  for  which  we  have  some  information,  ignoring 
the  intervals  outside  the  range  of  the  sample.  This  is  equivalent  to  assuming  that  outside 
the  sample  range,  the  true  density  has  the  same  mean  log  probability  density  as  the  rest  of 
the  distribution. 

2.2  Lowering  the  variance  of  the  estimate 

The  estimate  (25)  has  both  intuitive  and  theoretical  appeal^,  but  it  has  relatively  high 
variance  since  while  the  expectation  of  the  interval  probabilities  (15)  is  their  variance 
is  high.  The  upper  left  plot  of  Figure  1  shows  the  distribution  of  values  obtained  when 
we  divide  a  random  1— spacing  by  its  expected  value.  Clearly,  these  values  are  widely 
distributed  around  the  ideal  value  of  1. 


4.  The  addition  of  a  small  constant  renders  this  estimator  weakly  consistent  for  bonnded  densities  nnder 
certain  tail  conditions  (Hall  (1984)). 
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Figure  1:  Histograms  showing  the  variability  of  the  probability  mass  of  m-spacings,  as  a 
function  of  m.  Each  plot  shows,  for  a  particular  m,  the  ratio  of  a  set  of  random 
m-spacings  to  their  expected  values.  When  m  =  1  (upper  left  plot),  the  prob¬ 
ability  mass  of  the  m-spacings  is  widely  distributed.  Already  for  m  =  2  (upper 
right),  the  ratio  is  substantially  more  concentrated  around  its  expected  value  of 
1.  For  m  =  10  (lower  left),  the  m-spacings’  probability  masses  are  almost  always 
within  a  factor  of  three  of  their  expected  values.  For  m  =  100  (lower  right),  the 
probability  mass  of  the  m-spacings  is  highly  consistent.  This  behavior  is  a  simple 
consequence  of  the  law  of  large  numbers  and  the  uniformity  of  the  distribution 
obtained  through  the  probability  integral  transform. 
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This  problem  can  be  mitigated,  and  asymptotically  eliminated  completely,  by  consider¬ 
ing  m— spacing  estimates  of  entropy,  such  as 

N-l  ^ 

Hm-spac^ng{Z\  ...,  Z^)  =  ^  log  +  l .  (26) 

i=0  ^  ^  ^ 

By  letting 

7Tl 

oo,  —  ^0,  (27) 

this  estimator  also  becomes  consistent  (Vasicek  (1976),  Beirlant  et  al.  (1997)).  In  this  work, 
we  typically  set  m  =  '/N. 

The  intuition  behind  this  estimator  is  that  by  considering  m-spacings  with  larger  and 
larger  values  of  m,  the  variance  of  the  probability  mass  of  these  spacings,  relative  to  their 
expected  values,  gets  smaller  and  smaller.  This  behavior  is  illustrated  in  Figure  1.  Each  plot 
shows,  for  a  different  value  of  m,  the  distribution  of  the  ratio  between  random  m-spacings 
and  their  expected  value.  The  upper  left  plot  shows  that  for  m  =  1,  this  distribution  is 
distributed  very  widely.  As  m  grows,  the  probability  mass  for  each  m-spacing  concentrates 
around  its  expected  value. 

Such  plots,  which  are  functions  of  the  probablity  mass  of  intervals  defined  by  order 
statistics,  have  the  same  form  for  all  probability  distributions  with  continuous  cumulative 
distribution  functions.  That  is,  the  form  depends  only  on  the  value  of  m  and  not  at  all  on 
the  probability  law.  This  is  again  a  consequence  of  the  uniformity  in  distribution  of  the 
probability  integral  transform  for  any  (impulse-free)  density. 

A  modification  of  (26)  in  which  the  m— spacings  overlap^: 

Hradical{Z\...,Z^)^-^  ^log('^l±l(z(*+-)-Z«)V  (28) 

i=l  ^  ' 

is  used  in  RADICAL.  This  is  equivalent  asymptotically  to  the  estimator  (14)  of  Vasicek 
(1976).  Weak  and  strong  consistency  have  been  shown  by  various  authors  under  a  variety 
of  general  conditions  assuming  (27).  For  details  of  these  results,  see  the  review  paper  by 
Beirlant  et  al.  (1997).  Perhaps  the  most  important  property  of  this  estimator  is  that  it  is 
asymptotically  efficient,  as  shown  by  Levit  (1978). 

It  is  interesting  to  remark  that  while  (25)  and  (26)  have  a  natural  correspondence  to 
density  estimates  (if  we  ignore  the  region  outside  the  range  of  the  samples),  there  is  no 
trivial  correspondence  between  (28)  and  a  density  estimate.  We  are  thus  solving  the  entropy 
estimation  problem  without  demanding  that  we  solve  the  density  estimation  problem®. 

We  note  that  Pham  (2000)  defined  an  ICA  contrast  function  as  a  sum  of  terms  very 
similar  to  (26).  However,  by  choosing  m  =  1  as  was  done  in  that  work,  one  no  longer 
obtains  a  consistent  estimator  of  entropy,  and  the  efficiency  and  efficacy  of  (26)  as  an  ICA 
contrast  function  appears  to  be  greatly  reduced.  In  particular,  much  of  the  information 
about  whether  or  not  the  marginals  are  independent  is  ignored  in  such  an  approach. 

5.  Allowing  the  m-spacings  to  overlap  reduces  the  asymptotic  variance  of  the  estimator. 

6.  For  those  who  are  skeptical  that  this  is  possible,  we  suggest  that  it  is  no  different  than  estimating  the 

variance  of  a  random  variable  without  estimating  its  density. 


3.  RADICAL  in  Two  Dimensions 

Given  that  we  have  an  entropy  estimate  in  hand,  we  now  discuss  its  application  to  op¬ 
timizing  Equation  (13).  We  first  discuss  aspects  of  the  estimator  in  the  context  of  the 
two-dimensional  ICA  problem.  Later  we  will  extend  the  optimization  method  to  multiple 
dimensions.  Two  issues  which  arise  and  which  must  be  dealt  with  are  local  minima  and 
what  we  will  refer  to  as  “false  minima”.  The  first  issue  is  intrinsic  to  the  optimization 
criterion  and  appears  difficult  to  address  without  adopting  an  exhaustive  search  strategy. 
The  second  is  a  function  of  the  estimator  and  will  motivate  a  smoothing  approach.  We  ad¬ 
dress  these  issues  by  way  of  some  canonical  examples  before  proceeding  to  a  more  detailed 
discussion  of  the  algorithm. 

3.1  Canonical  empirical  examples 

We  use  three  canonical  examples-  separation  of  (1)  two  uniform  densities,  (2)  two  double¬ 
exponential  densities,  and  (3)  two  bi-modal  Gaussian  mixture  densities  -  in  which  we  ex¬ 
amine  150  equally  spaced  rotations  of  the  data  between  0  and  90  degrees.  Each  of  these 
examples  illustrates  various  aspects  of  the  estimator  and  the  optimization  procedure. 

Gonsider  Eigure  2  which  shows  some  results  from  separating  a  mixture  of  two  uniform 
densities.  Eigure  2(a)  shows  the  results  over  100  Monte  Garlo  trials  in  which  N  =  250  and 
m  =  16  ~  \/250.  As  can  be  seen,  the  mean  estimate  (over  90  degrees  of  rotation)  is  fairly 
smooth  with  a  clear  global  maximum  at  45  degrees  and  a  minimum  at  0  degrees  rotation 
(or  equivalently  90  degrees).  However,  not  surprisingly,  any  one  trial  has  several  false  local 
minima  and  maxima.  In  Figure  2(a),  for  example,  the  individual  trials  which  exhibited  the 
largest  positive  and  negative  deviation  from  the  average  (for  any  single  angle  9)  over  all 
trials  are  overlaid  on  the  average  result.  Similar  figures  for  the  other  two  cases  are  shown 
in  Figures  3(a)  and  4(a),  although  local  minima  and  maxima  (over  individual  trials))  are 
not  as  severe  in  these  cases. 

In  particular,  false  minima  can  become  quite  severe  when  the  sample  size  is  small.  They 
are  a  consequence  of  the  fact  that  the  m-spacings  estimator  makes  no  prior  smoothness  as¬ 
sumptions  (e.g.  limited  spatial  frequency)  regarding  the  underlying  densities.  Gonsequently, 
for  small  sample  size  there  exist  rotations  (instances  of  W)  for  which  portions  of  the  data 
spuriously  approximately  align,  producing  an  artificial  spike  in  one  of  the  marginal  distri¬ 
butions.  This  is  most  easily  understood  by  considering  the  case  in  which  m,  the  number  of 
intervals  combined  in  an  m-spacing,  is  equal  to  1.  In  this  case,  for  any  value  of  N  there  are 
many  rotations  (O(A^)  of  them,  in  fact)  which  will  cause  two  points  to  align  exactly,  either 
vertically  or  horizontally.  This  causes  the  1— spacing  corresponding  to  these  two  points  to 
have  width  0  in  one  of  the  empirical  marginal  distributions,  which  in  turn  gives  this  inter¬ 
val  an  average  logarithm  of  — oo.  This  results  in  a  marginal  entropy  estimate  of  — oo  for 
this  particular  rotation  of  the  data.  The  entropy  estimator  has  no  evidence  that  there  is 
not,  in  fact,  an  impulse  in  the  true  marginal  density  which  would  legitimately  indicate  a 
negatively  infinite  entropy,  so  it  is  not  a  fundamental  flaw  with  the  estimator.  Rather,  it  is 
a  consequence  of  allowing  arbitrarily  peaked  implicit  marginal  estimates.  While  this  issue 
becomes  less  of  a  problem  as  N  and  m  grow,  our  empirical  findings  suggest  that  for  the 
densities  considered  in  this  paper  (see  Figure  6),  it  must  be  addressed  to  achieve  optimal 
performance  at  least  while  N  <  2000. 
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Figure  2:  RADICAL  for  a  mixture  of  two  uniform  densities:  (a)  The  thick  solid  curve  is 
the  mean  estimate  of  Equation  (13)  over  100  Monte  Carlo  trials  with  no  data 
augmentation.  The  dotted  curves  indicate  plus  or  minus  one  standard  deviation, 
while  the  thinner  (less  smooth)  curves  are  the  two  trials  which  had  the  largest 
positive  and  negative  deviation  from  the  mean  respectively,  (b)  is  exactly  the 
same  as  (a)  except  that  the  data  set  was  augmented  with  R  =  30  and  a  smoothing 
factor  of  cJr  =  0.1  was  used,  (c)  is  one  realization  of  the  original  sources  with  no 
rotation,  (d)  with  25  degrees  rotation,  and  (e)  with  45  degrees  rotation.  Axes  of 
rotation  are  overlaid  on  the  plots. 
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Figure  3:  RADICAL  for  a  mixture  of  two  double-exponential  densities:  (a)  The  thick  solid 
curve  is  the  mean  estimate  of  Equation  (13)  over  100  Monte  Carlo  trials  with 
no  data  augmentation.  The  dotted  curves  indicate  plus  or  minus  one  standard 
deviation,  while  the  thinner  (less  smooth)  curves  are  the  two  trials  which  had 
the  largest  positive  and  negative  deviation  from  the  mean  respectively,  (b)  is 
exactly  the  same  as  (a)  except  that  the  data  set  was  augmented  with  R  =  30  and 
a  smoothing  factor  of  ar  =  0.1  was  used,  (c)  is  one  realization  of  the  original 
sources  with  no  rotation,  (d)  with  25  degrees  rotation,  and  (e)  with  45  degrees 
rotation.  Axes  of  rotation  are  overlaid  on  the  plots. 
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Figure  4:  RADICAL  for  a  mixture  of  two  bi-modal  densities:  (a)  The  thick  solid  curve  is 
the  mean  estimate  of  Equation  (13)  over  100  Monte  Carlo  trials  with  no  data 
augmentation.  The  dotted  curves  indicate  plus  or  minus  one  standard  deviation, 
while  the  thinner  (less  smooth)  curves  are  the  two  trials  which  had  the  largest 
positive  and  negative  deviation  from  the  mean  respectively,  (b)  is  exactly  the 
same  as  (a)  except  that  the  data  set  was  augmented  with  R  =  30  and  a  smoothing 
factor  of  cJr  =  0.1  was  used,  (c)  is  one  realization  of  the  original  sources  with  no 
rotation,  (d)  wth  25  degrees  rotation,  and  (e)  with  45  degrees  rotation.  Axes  of 
rotation  are  overlaid  on  the  plots. 
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To  address  this  problem,  we  consider  a  smoothed  version  of  the  estimator.  We  attempt 
to  remove  such  false  minima  by  synthesizing  R  replicates  of  each  of  the  original  N  sample 
points  with  additive  spherical  Gaussian  noise  to  make  a  surrogate  data  set  X' .  That  is,  each 
point  is  replaced  with  R  samples  from  the  distribution  N{X^,a‘^I),  where  R  and  ci^ 
become  parameters  of  the  algorithm.  We  refer  to  this  as  augmenting  the  data  set  X.  Then 
we  use  the  entropy  estimator  (28)  on  the  augmented  data  set  X' .  R  was  chosen  to  have  a 
value  of  30  for  most  of  the  experiments  in  this  report.  Results  of  this  modification  to  the 
estimator  are  shown  in  Figures  2(b),  3(b)  and  4(b).  While  not  completely  eliminating  the 
problem  of  local  minima  and  maxima,  the  results  over  the  worst  two  trials  are  significantly 
smoother. 

After  reducing  the  effect  of  false  minima  on  the  optimization  of  IT,  we  must  still  address 
legitimate  local  minima,  the  best  example  of  which  is  shown  in  Figure  4  where  a  rotation 
of  45°  is  a  true  local  minimum  of  the  objective  function.  For  two-dimensional  source  sep¬ 
aration,  taking  advantage  of  the  fact  that  W{9)  is  a  one-dimensional  manifold,  we  do  an 
exhaustive  search  over  IT  for  K  values  of  6.  Note  that  we  need  only  consider  9  in  the  inter¬ 
val  [0,  since  any  90  degree  rotation  will  result  in  equivalent  independent  components. 
In  all  of  our  experiments,  we  set  K  =  150.  Importantly,  it  turns  out  that  even  in  higher 
dimensions,  our  algorithm  will  remain  linear  in  K  (although  polynomially  more  expensive 
in  other  respects),  so  it  is  relatively  inexpensive  to  do  a  finer  grain  search  over  9  if  desired. 
Complexity  issues  will  be  discussed  in  more  detail  below. 

3.2  The  ICA  algorithm 

RADICAL  is  a  very  simple  algorithm.  Assuming  that  our  observed  data  have  already  been 
whitened,  there  are  really  only  two  remaining  steps.  The  first  is  to  generate  an  augmented 
data  set  X'  from  X  by  the  procedure  described  in  the  previous  section.  The  second  is 
to  minimize  the  cost  function  (13),  which  we  do  by  exhaustive  search.  Various  aspects  of 
RADICAL  are  summarized  in  Figure  5. 

There  are  four  parameters  with  which  we  experimented  informally  in  our  early  ex¬ 
periments.  The  first  parameter  m,  determines  the  number  of  intervals  combined  in  an 
m— spacing.  As  stated  above,  we  chose  m  =  ^/N  for  all  of  our  experiments,  which  guaran¬ 
tees  the  asymptotic  consistency  of  our  procedure  as  long  as  none  of  the  marginal  densities 
have  impulses.  When  condition  (27)  is  satisfied,  the  entropy  estimator  will  be  consistent 
and  should  perform  well  for  large  N .  For  small  A,  performance  can  be  improved  by  choos¬ 
ing  m  according  to  the  particulars  of  the  distribution,  but  since  the  distribution  is  unknown 
in  general,  we  avoided  this  and  chose  a  fixed  rule  for  m  as  a  function  of  N  for  all  of  our 
experiments. 

A  second  parameter  is  the  number  of  points  R  used  to  replace  each  original  point  X^ 
when  creating  the  augmented  data  set.  The  value  of  R  can  be  made  smaller  for  large  A,  and 
as  A  and  m  get  large  enough,  point  replication  is  entirely  unnecessary,  since  the  optimization 
landscape  will  eventually  become  smooth  (to  within  the  resolution  of  the  search  algorithm) . 
However,  at  A  =  4000,  the  largest  value  with  which  we  experimented,  point  replication  was 
still  necessary.  The  experiments  in  this  paper  all  used  a  value  of  i?  =  30,  irrespective  of  the 
original  sample  size  A. 
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Algorithm: 

Input: 

Parameters: 


Procedure: 


Output: 


RADICAL,  two-dimensional  version. 

Data  vectors  assumed  whitened. 

m\  Size  of  spacing.  Usually  equal  to  \/A- 
Noise  variance  for  replicated  points. 

R:  Number  of  replicated  points  per  original  data  point. 

K:  Number  of  angles  at  which  to  evaluate  cost  function. 

1.  Create  X'  by  replicating  R  points  with  Gaussian  noise  for  each 
original  point. 

2.  For  each  0,  rotate  the  data  to  this  angle  {Y  =  W{d)  *  X') 
and  evaluate  cost  function. 

3.  Output  the  W  corresponding  to  the  optimal  6. 

W,  the  demixing  matrix. 


Figure  5:  A  high-level  description  of  RADICAL  for  the  two-dimensional  ICA  problem. 


Next  we  examine  the  variance  of  the  R  added  points  for  each  of  the  N  points  XY 
As  expected,  from  informal  testing,  we  found  that  performance  was  somewhat  better  if 
we  allowed  the  variance  to  shrink  as  N  grew.  However,  performance  was  relatively  robust 
to  the  choice  of  this  parameter  and  we  chose  only  two  different  values  of  ar  for  all  of  our 
experiments.  For  N  <  1000,  we  set  ar  =  0.35  and  for  N  >=  1000,  we  halved  this  value, 
setting  ar  =  .175. 

The  only  remaining  parameter  for  RADICAL  in  two  dimensions  is  K,  the  number  of 
rotations  at  which  to  measure  the  objective  function.  In  informal  experiments,  we  tried 
values  of  50, 100, 150,  and  250.  There  was  no  noticable  improvement  in  performance  after 
for  K  >  150,  even  for  N  =  4000  and  the  higher  dimensional  tests.  Of  course,  with  N  large 
enough,  one  could  benefit  to  some  extent  by  increasing  K.  Since  in  two  dimensions,  the 
error  metric  (see  below)  is  proportional  to  the  difference  in  angle  between  the  estimated  6 
and  the  optimal  0,  it  is  easy  to  see  that  asymptotically,  the  expected  error  is  a  function  of 
K  and  is  approximately 

1  77 

ix  =  JL 

2K  4K' 

For  K  =  150,  this  asymptotic  expected  error  is  approximately  0.005,  a  number  small  enough 
so  that  this  is  only  a  minor  contribution  to  the  total  error  (see  experiments)  for  values  of 
N  considered  here.  Since  both  the  two-dimensional  and  higher-dimensional  versions  of 
RADICAL  are  linear  in  K,  it  is  relatively  inexpensive  to  increase  the  resolution  of  the 
exhaustive  search. 


3.3  Algorithmic  complexity 

An  upper  bound  on  the  algorithmic  complexity  of  RADICAL  in  two  dimensions  is  fairly 
straightforward  to  compute.  There  are  a  number  of  opportunities  for  speedup  which  we 
will  leave  for  future  work.  We  will  assume  for  this  analysis  and  for  the  higher  dimensional 
case  discussed  later  that  D,  the  dimension,  is  less  than  N,  the  sample  size. 

We  assume  that  the  data  has  been  whitened  to  begin  with.  Whitening  the  data  is 
0{D‘^N).  In  two  dimensions,  we  will  treat  D  as  a  constant,  so  this  gives  us  0{N). 


14 


Augmenting  the  data  set  with  R  noisy  copies  of  each  point  is  just  0{NR).  Let  N'  =  NR 
be  the  size  of  the  augmented  data  set.  Rotation  of  the  augmented  data  points  to  an 
angle  6  by  matrix  multiplication  is  at  most  0{D‘^N'),  but  again  for  fixed  D,  we  can  call 
a  constant,  so  this  reduces  to  0{N').  In  two  dimensions,  our  estimator  requires  two 
one-dimensional  sorts,  which  will  take  time  0{N'  \ogN')  and  two  sums  over  at  most  N' 
spacings,  which  is  0{N').  Thus,  evaluating  the  objective  function  once,  which  involves 
matrix  multiplication,  sorting,  and  summing,  takes  time  0{N')  +  0{N'  log  N')  +  0{N')  = 
0{N'  log  N').  Note  that  m,  the  spacing  size,  does  not  enter  into  the  complexity  of  evaluating 
the  objective  function. 

We  repeat  this  procedure  K  times  in  our  exhaustive  search.  This  gives  us  an  up¬ 
per  bound  of  0{KN'  \ogN')  for  the  minimization  of  the  objective  function.  For  the 
whole  algorithm,  including  whitening,  we  then  have  0{N)  +  0{KN'  \ogN')  =  0{N)  + 
0{KN R\og{N R))  =  0{KN R\og{N R))  as  the  final  complexity  for  the  two-dimensional 
algorithm.  As  mentioned  previously,  it  should  be  possible  to  reduce  i?  to  1  for  large  N ,  so 
technically,  we  can  claim  that  RADICAL  is  0{KN\ogN).  However,  for  moderate  and  low 
values  of  N ,  we  must  still  choose  i?  >  1,  and  so  we  include  it  in  our  complexity  analysis. 


3.4  Experiments  in  two  dimensions 

To  test  the  algorithm  experimentally,  we  performed  a  large  set  of  experiments,  largely  drawn 
from  the  comprehensive  tests  developed  by  Bach  and  Jordan  (2002).  Our  tests  included 
comparisons  with  FastICA  (Hyvarinen  and  Oja  (1997)),  the  JADE  algorithm  (Cardoso 
(1999a)),  the  extended  Infomax  algorithm  (Lee  et  al.  (1999b)),  and  two  versions  of  Kernel- 
ICA:  KCCA  and  KGV  (Bach  and  Jordan  (2002)). 

For  each  of  the  18  one-dimensional  densities  shown  in  Figure  6,  and  which  were  normal¬ 
ized  to  have  zero  mean  and  unit  variance,  the  following  experiments  were  performed.  Using 
a  density  g(-),  N  points  were  drawn  iid  from  the  product  distribution  q{-)q{-)-  The  points 
were  then  subjected  to  a  random  rotation  matrix  A  to  produce  the  input  X  for  the  algo¬ 
rithm^.  We  then  measured  the  “difference”  between  the  true  matrix  A  and  the  W  returned 
by  the  algorithm,  according  to  the  well-known  criterion,  due  to  Amari  et  al.  (1996): 


d{A,W) 


2D  ymaxj  \bij\  j  2D  ymaxj  \bij\ 


(30) 


where  bij  =  {AW~^)ij. 

Table  1  shows  the  mean  results  for  each  source  density  on  each  row,  with  N  =  250, 
the  number  of  input  points,  and  100  replications  of  each  experiment.  The  best  performing 
algorithm  on  each  row  is  shown  in  bold  face.  Note  that  RADICAL  performs  best  in  10 
of  18  experiments,  substantially  outperforming  the  second  best  in  many  cases.  The  mean 
performance  in  these  experiments  is  shown  in  the  row  labeled  mean,  where  RADICAL 
has  lower  error  than  all  other  algorithms  tested.  The  final  row  of  the  table  represents 
experiments  in  which  two  (generally  different)  source  densities  were  chosen  randomly  from 

7.  Alternatively,  we  could  have  applied  a  random  non-singular  matrix  to  the  data,  and  then  whitened  the 
data,  keeping  track  of  the  whitening  matrix.  For  the  size  of  N  in  this  experiment,  these  two  methods 
are  essentially  equivalent. 
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Figure  6:  Probability  density  functions  of  sources  with  their  kurtoses:  (a)  Student  with 
three  degrees  of  freedom;  (b)  double  exponential;  (c)  uniform;  (d)  Student  with 
five  degrees  of  freedom;  (e)  exponential;  (f)  mixture  of  two  double  exponen¬ 
tials;  (g)-(h)-(i)  symmetric  mixtures  of  two  Gaussians:  multimodal,  transitional 
and  unimodal;  (m)-(n)-(o)  symmetric  mixtures  of  four  Gaussians:  multimodal, 
transitional  and  unimodal;  (p)-(q)-(r)  nonsymmetric  mixtures  of  four  Gaussians: 
multimodal,  transitional  and  unimodal.  This  figure  and  code  to  sample  from 
these  distributions  was  generously  provided  by  Francis  Bach,  UG  Berkeley. 
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Table  1 


pdfs 

FastICA 

Jade 

Imax 

KCCA 

KGV 

RADICAL 

a 

8.9 

7.5 

56.3 

6.6 

5.7 

5.6 

b 

10.2 

9.3 

61.8 

8.4 

6.2 

7.0 

c 

4.4 

3.1 

18.4 

4.7 

4.3 

2.4 

d 

11.8 

10.0 

61.1 

13.1 

11.6 

12.6 

e 

8.1 

7.4 

67.7 

3.7 

3.1 

1.7 

f 

7.9 

5.5 

12.4 

3.6 

3.3 

2.0 

g 

3.9 

2.9 

18.1 

3.1 

2.9 

1.4 

h 

11.1 

8.2 

27.2 

10.8 

8.4 

12.1 

i 

18.5 

16.7 

37.6 

25.2 

23.2 

27.0 

j 

12.2 

12.8 

50.5 

3.1 

3.0 

1.7 

k 

14.1 

10.3 

30.2 

6.1 

5.2 

5.5 

1 

22.6 

16.4 

39.2 

10.6 

8.7 

11.7 

m 

8.2 

6.9 

29.5 

14.8 

12.3 

1.9 

n 

11.4 

9.7 

32.1 

16.3 

9.7 

3.9 

o 

8.7 

6.8 

23.7 

13.0 

9.4 

8.6 

p 

9.9 

6.7 

29.1 

7.7 

6.0 

2.6 

q 

35.8 

32.0 

39.1 

12.4 

9.4 

5.3 

r 

13.0 

9.5 

27.7 

9.7 

7.2 

8.9 

mean 

12.3 

10.1 

36.8 

9.6 

7.8 

6.8 

rand 

10.7 

8.5 

29.6 

8.3 

6.0 

5.8 

:  The  Amari  errors  (multiplied  by  100)  for  two-component  ICA  with  250  samples. 
For  each  pdf  (from  a  to  r),  averages  over  100  replicates  are  presented.  For  each 
distribution,  the  lowest  error  rate  is  shown  in  bold  face.  The  overall  mean  is 
calculated  in  the  row  labeled  mean.  The  rand  row  presents  the  average  over 
1000  replications  when  two  (generally  different)  pdfs  were  chosen  uniformly  at 
random  among  the  18  possible  pdfs. 
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pdfs 

FastICA 

Jade 

Imax 

KCCA 

KGV 

RADICAL 

4.4 

3.7 

1.8 

3.7 

3.0 

2.1 

5.8 

4.1 

3.4 

3.7 

2.9 

2.7 

2.3 

1.9 

2.0 

2.7 

2.4 

1.2 

6.4 

6.1 

6.9 

7.1 

5.7 

5.3 

4.9 

3.9 

3.2 

1.7 

1.5 

0.9 

3.6 

2.7 

1.0 

1.7 

1.5 

1.0 

g 

1.8 

1.4 

0.6 

1.5 

1.4 

0.6 

h 

5.1 

4.1 

3.1 

4.6 

3.6 

3.7 

i 

10.0 

6.8 

7.8 

8.3 

6.4 

8.3 

j 

6.0 

4.5 

50.6 

1.4 

1.3 

0.8 

k 

5.8 

4.4 

4.2 

3.2 

2.8 

2.7 

1 

11.0 

8.3 

9.4 

4.9 

3.8 

4.2 

m 

3.9 

2.8 

3.9 

6.2 

4.7 

1.0 

n 

5.3 

3.9 

32.1 

7.1 

3.0 

1.8 

o 

4.4 

3.3 

4.1 

6.3 

4.5 

3.4 

3.7 

2.9 

8.2 

3.6 

2.8 

1.1 

19.0 

15.3 

43.3 

5.2 

3.6 

2.3 

5.8 

4.3 

5.9 

4.1 

3.7 

3.2 

mean 

6.1 

4.7 

10.6 

4.3 

3.3 

2.6 

rand 

5.1 

4.1 

6.8 

3.1 

2.0 

2.1 

Table  2:  The  Amari  errors  (multiplied  by  100)  for  two-component  ICA  with  1000  samples. 

For  each  pdf  (from  a  to  r),  averages  over  100  replicates  are  presented.  The  overall 
mean  is  calculated  in  the  row  labeled  mean.  The  rand  row  presents  the  average 
over  1000  replications  when  two  (generally  different)  pdfs  were  chosen  uniformly 
from  random  among  the  18  possible  pdfs. 
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the  set  of  18  densities  to  produce  the  product  distribution  from  which  points  were  sampled. 
1000  replications  were  performed  using  these  randomly  chosen  distributions.  For  these 
experiments,  RADICAL  has  a  slight  edge  over  Kernel-ICA,  but  they  both  significantly 
outperform  the  other  methods. 

Table  2  shows  an  analogous  set  of  results  for  larger  data  sets,  with  N  =  1000.  Again, 
RADICAL  outperforms  the  other  algorithms  for  most  densities.  However,  Kernel-ICA 
outperforms  RADICAL  by  a  small  margin  in  the  randomized  experiments. 


3.5  Robustness  to  outliers 

Figure  7  shows  results  for  our  outlier  experiments.  These  experiments  were  again  replica¬ 
tions  of  the  experiments  performed  by  Bach  and  Jordan  (2002).  Following  Bach  and  Jordan, 
we  simulated  outliers  by  randomly  choosing  up  to  25  data  points  to  corrupt.  This  was  done 
by  adding  the  value  -|-5  or  -5  (chosen  with  probability  1/2)  to  a  single  component  in  each  of 
the  selected  data  points.  We  performed  100  replications  using  source  distributions  chosen 
uniformly  at  random  from  the  18  possible  sources. 

It  can  be  seen  that  RADICAL  is  uniformly  more  robust  to  outliers  than  all  other  methods 
in  these  experiments,  for  every  number  of  outliers  added. 


4.  RADICAL  in  D  dimensions 

Clearly  RADICAL  will  be  more  useful  if  it  can  be  applied  in  higher  dimensions  than  D  = 
2.  While  projections  and  rotations  of  high  dimensional  data  present  no  challenge,  one 
might  worry  that  our  objective  function  is  difficult  to  minimize,  especially  since  our  entropy 
estimator  is  not  differentiable.  It  is  known  all  too  well  that  exhaustive  search  in  more  than 
a  few  dimensions  is  infeasible,  as  its  complexity  is  0{N^),  where  N  must  be  large  to  insure 
accuracy. 

It  turns  out,  however,  that  for  the  ICA  problem,  the  minimization  can  still  be  ap¬ 
proached  in  an  “exhaustive”  manner.  Successive  minimizations  along  different  pairs  of 
dimensions  works  well.  That  is,  we  can  recast  the  D— dimensional  ICA  problem  as  a  series 
of  two-dimensional  ICA  problems,  which  we  can  solve  well.  Empirically,  we  show  that  for 
dimensions  as  high  as  16,  RADICAL  on  average  outperforms  or  performs  similarly  to  all 
other  algorithms  against  which  we  tested  it. 

4.1  Jacobi  rotations 

To  find  the  D— dimensional  rotation  matrix  W*  that  optimizes  (13)  in  D  dimensions,  we 
use  Jacobi  methods  such  as  those  used  to  solve  symmetric  eigenvalue  problems  (see  Golub 
and  Loan  (1996)).  The  basic  idea  is  to  rotate  the  augmented  data  X'  two  dimensions  at  a 
time  using  what  are  known  as  Jacobi  rotations^.  A  Jacobi  rotation  of  angle  6  for  dimensions 


8.  Jacobi  rotations  are  also  known  as  Givens  rotations. 
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Figure  7:  Robustness  to  outliers.  The  abcissa  displays  the  number  of  outliers  and  the 
ordinate  shows  the  Amari  error. 
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p  and  q  is  defined  as: 


1  •••  0  •••  0  •••  0 

0  •  •  •  cos(0)  •  •  •  —  sin(0)  •  •  •  0 

J{p,q,d)=  :  :  ,  (31) 

0  •  •  •  sin(0)  •  •  •  cos{9)  ■  ■  ■  0 

0  •••  0  •••  0  •••  1 


where  the  sines  and  cosines  appear  on  the  pth  and  gth  rows  and  columns  of  the  matrix. 
Since  a  Jacobi  rotation  leaves  all  components  of  an  H-dimensional  data  point  unchanged 
except  for  the  pth  and  gth  components,  optimizing  our  objective  function  (13)  reduces  to  a 
two-dimensional  ICA  problem  for  each  distinct  Jacobi  rotation. 

Algorithmically,  we  initialize  Y  to  our  augmented  data  set  X' ,  and  our  rotation  matrix 
W  to  the  identity  matrix.  After  optimizing  our  objective  function  for  a  pair  of  dimensions 
(p,  g) ,  we  update  Y : 

Ynev.  =  J{p,q,e*)Y,  (32) 

keeping  track  of  our  cumulative  rotation  matrix: 

Wne^  =  J{p,q,e*)W.  (33) 

Note  that  since  each  Jacobi  rotation  affects  only  two  components  of  Y,  this  is  an  0{2^NR)  = 
0{NR)  operation.  Thus,  full  scale  U— dimensional  rotations  need  never  be  done  (all  at 
once).  This  is  discussed  further  below. 

There  are  D{D  —  l)/2  distinct  Jacobi  rotations  (parameterized  by  9),  and  performing 
a  set  of  these  is  known  as  a  sweep.  Empirically,  performing  multiple  sweeps  improves  our 
estimate  of  W*  for  some  number  of  iterations,  and  after  this  point,  the  error  may  increase  or 
decrease  sporadically  near  its  smallest  value.  The  number  of  sweeps  S  becomes  an  additional 
parameter  for  multi-dimensional  RADICAL.  We  found  that  5  ~  D  provided  good  results 
for  all  of  our  multi-dimensional  experiments. 

4.2  Complexity  of  RADICAL  in  D  dimensions 

The  complexity  of  RADICAL  in  D  dimensions  is  again  straightforward.  Starting  with  the 
whitening  of  the  data,  we  must  now  spend  0{D‘^N)  time  on  this  step.  Of  course,  we  can 
no  longer  legitimately  treat  the  dimension  D  as  a  constant.  Producing  the  augmented  data 
set  X'  now  becomes  an  0{DNR)  procedure,  but  this  will  be  dominated  by  other  terms. 

A  single  sweep  through  D{D  —  1) /2  Jacobi  rotations  produces  a  complexity  increase  by  a 
factor  of  0{D^)  for  a  total  sweep  complexity  of  0{K{D‘^)N' log  N').  Recall  that  N'  =  NR 
is  the  size  of  the  augmented  data  set.  The  number  of  sweeps  necessary  for  convergence 
is  difficult  to  predict,  but  it  seldom  exceeded  the  dimension  D.  Including  the  number  of 
sweeps  in  the  complexity  gives  0{SK{D‘^)N'  log  N'). 

It  should  be  pointed  out  that  this  complexity  analysis  includes  the  optimization,  so  it 
should  be  compared  against  the  total  run  time  of  other  algorithms,  not  simply  the  time  to 
evaluate  the  objective  function. 
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Algorithm: 

Input: 

Parameters: 


Procedure: 


Output: 


RADICAL,  D-dimensional  version. 

Data  vectors  assumed  whitened. 

m\  Size  of  spacing.  Usually  equal  to  \/A- 
Noise  variance  for  replicated  points. 

R:  Number  of  replicated  points  per  original  data  point. 

K:  Number  of  angles  at  which  to  evaluate  cost  function. 

S:  Number  of  sweeps  for  Jacobi  rotations. 

1.  Create  X'  by  replicating  R  points  with  Gaussian  noise  for  each  original 
point. 

2.  For  each  of  S  sweeps  (or  until  convergence): 

a.  For  each  of  D{D  —  l)/2  Jacobi  rotations  for  dimensions  (p,  q): 

i.  Perform  2-D  RADICAL  optimization,  returning  optimal  J{p,q,6*). 

ii.  Update  Y  according  to  Ynew  =  J{p,Q,G*)Y. 

hi.  Update  W  according  to  Wmw  =  ■ 

3.  Output  final  W . 

W 


Figure  8:  A  high-level  description  of  RADICAL  for  D  dimensions. 


dims 

N 

#repl 

FastICA 

Jade 

Imax 

KGV 

RADICAL 

2 

250 

1000 

11 

9 

30 

5 

6 

1000 

1000 

5 

4 

7 

2 

2 

4 

1000 

100 

18 

13 

25 

11 

6 

4000 

100 

8 

7 

11 

4 

3 

8 

2000 

50 

26 

22 

123 

20 

11 

4000 

50 

18 

16 

41 

8 

8 

16 

4000 

25 

42 

38 

130 

19 

15 

Table  3:  Results  for  experiments  in  higher  dimensions.  The  table  shows  experiments  for 
dimensions  two  through  16.  The  number  of  points  used  for  each  experiment  is 
shown  in  the  second  column  and  the  number  of  experiment  replications  performed 
to  obtain  the  mean  values  at  right  is  given  in  the  third  column.  KGV  is  Kernel- 
ICA  using  the  kernel  generalized  variance.  Note  that  RADICAL  performed  best 
or  equal  to  best  in  all  but  one  experiment. 


4.3  Results  for  D  dimeusious 

Table  3  presents  results  of  experiments  for  multiple  dimensions.  In  each  experiment  for 
dimension  D,  D  (generally)  different  densities  were  selected  at  random  from  the  set  of  18 
densities  discussed  above.  Again  this  data  was  randomly  rotated,  and  the  task  was  to 
recover  the  independent  components.  Notice  that  RADICAL  is  either  the  best  or  second 
best  performer  in  each  case,  performing  better  than  all  other  algorithms  in  four  of  seven 
experiments. 
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5.  Conclusions 

We  have  presented  a  novel  algorithm,  RADICAL,  for  independent  component  analysis. 
Our  approach  was  predicated  on  several  principles.  First,  direct  estimation  of  entropy 
obviates  the  need  for  density  estimation  as  an  intermediate  step.  Second,  over  the  space  of 
smooth  densities  there  are  unavoidable  local  minima  in  the  commonly  used  K-L  divergence 
based  optimization  landscape.  This  necessitated  in  some  respects  a  global  search  over 
the  parameter  space  in  order  to  achieve  good  convergence  properties  over  a  broad  set  of 
source  densities.  Toward  that  end  we  employed  a  variant  of  the  nonparametric  entropy 
estimator  of  Vasicek  (1976)  which  is  both  computationally  efficient  and  robust.  In  addition, 
our  algorithm  is  easily  used  in  higher  dimensions.  Empirical  results  were  reported  for  a 
significant  number  of  2-D  separation  problems,  2-D  separation  with  outliers,  and  a  range 
of  multi-dimensional  separation  problems.  Our  empirical  results  demonstrated  comparable 
or  superior  results  (as  measured  by  the  Amari  error)  to  a  large  number  of  well  known 
algorithms. 

While  these  initial  results  are  promising,  there  is  still  room  for  improvement  in  the 
algorithms  as  presented  from  both  a  computational  and  theoretical  perspective.  On  the 
computational  side,  we  take  no  advantage  of  local  changes  in  sorting  order  due  to  local 
changes  in  rotation.  Consequently,  the  application  of  standard  sorting  algorithms  for  such 
scenarios  would  be  expected  to  greatly  reduce  the  computational  complexity  of  the  analysis. 
From  the  theoretical  perspective  we  presented  a  smoothed  variant  of  the  Vasicek  estimator. 
Smoothing  was  accomplished  via  Monte  Carlo  techniques  which  might  be  avoided  entirely 
(also  reducing  the  computational  complexity)  by  considering  alternative  methods  for  biasing 
the  entropy  estimate  for  smooth  densities.  Such  will  be  the  focus  of  our  future  efforts. 
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